In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this project, we will explore the RMS Titanic passenger manifest to determine whether someone survived or did not survive.Demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic Dataset is obtained from kaggle (https://www.kaggle.com/c/titanic/data).
In [1]:
import numpy as np
import pandas as pd
from IPython.display import display
%matplotlib inline
# Load the dataset
files = 'titanic_data.csv'
data_titanic = pd.read_csv(files)
display(data_titanic.head())
From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:
NaN
)NaN
)Variable Notes
pclass: A proxy for socio-economic status (SES)
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
In [2]:
data =data_titanic
# Show the dataset
display(data.head())
data.info()
From the above info(),We can see columns Age, Cabin and Embarked have missing values.
Handling the missing values:
Ignore the rows with missing data,
Exclude the variable at all or we might substite it with mean or median.
Age 80% of the data is available,which seems a important variable so not to exclude.
Port of embarkation doesn't seem interesting.
cabin 23% of the data so decided to exclude.
PassengerId,Name,fare doesnt seem to contribute to any survival investigation
In [3]:
#exculding some coloumns
del data['Ticket']
del data['Cabin']
del data['Embarked']
del data['Name']
del data['PassengerId']
del data['Fare']
In [4]:
data.describe(include='all')
Out[4]:
In [5]:
# Calculate number of missing values
data.isnull().sum()
Out[5]:
In [6]:
null_female = data[pd.isnull(data['Age'])]['Sex'] == 'female'
null_male = data[pd.isnull(data['Age'])]['Sex'] == 'male'
print "Total missing age for female:",null_female.sum()
print "Total missing age for male:",null_male.sum()
lets decide should we remove missing age rows or fill the missing values with the mean, I'm going first to split the sample data into 2 samples with missing age and with age and perform a t test
In [7]:
notnull_age = data[pd.notnull(data['Age'])]
null_age = data[pd.isnull(data['Age'])]
Hypothesis
To fill the missing data with mean, i will decide with t test by being sure that passengers in these 2 samples are likely to have the similar survival rate.
If the resulted p value is going to be less than the critical value (with alpha level 0.05), I should reject the null hypothesis and conclude that population means are different not by chance (Ignoring the data of missing data which almost 20% of the data should be neglected).
I'm using the existing in scipy.stats function to perform t test for independent variables:
In [8]:
from scipy.stats import ttest_ind
ttest_ind(notnull_age['Survived'], null_age['Survived'])
Out[8]:
p value is than 0.05 which results in rejecting H0 ,so there is a significant difference in mean .So, I'm going to substitute the missing values with the mean age.
In [9]:
print "Age median values by Age and Sex:"
#we are grouping by gender and class and taking median of age so we can replace with corrresponding values instead of NaN
print data.groupby(['Sex','Pclass'], as_index=False).median().loc[:, ['Sex','Pclass', 'Age']]
print "Age values for 5 first persons in dataset:"
print data.loc[data['Age'].isnull(),['Age','Sex','Pclass']].head(5)
# apply transformation: Age missing values are filled with regard to Pclass and Sex:
data.loc[:, 'Age'] = data.groupby(['Sex','Pclass']).transform(lambda x: x.fillna(x.median()))
print data.loc[[5,17,19,26,28],['Age','Sex','Pclass']].head(5)
data['Age'] = data['Age'].fillna(data['Age'].mean())
In [10]:
data.describe(include='all')
Out[10]:
We can see that all columns have identical length.
In [11]:
data_s=data
survival_group = data_s.groupby('Survived')
survival_group.describe()
Out[11]:
From the above statistics
In [12]:
# Seriously i dont understand why age is 0.42
data_s[data_s['Age'] < 1]
Out[12]:
These must be new borns and all survived
In [13]:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for all graphs
#sns.set_style("light")
#sns.set_style("whitegrid")
sns.set_style("ticks", {"xtick.major.size": 8, "ytick.major.size": 8})
In [14]:
def plot(a,i):
fig=plt.figure() #Plots in matplotlib reside within a figure object, use plt.figure to create new figure
#Create one or more subplots using add_subplot, because you can't create blank figure
ax = fig.add_subplot(1,1,1)
#Variable
ax.hist(data[a],bins = i) # Here you can play with number of bins
plt.title(a + ' distribution')
plt.xlabel(a)
plt.ylabel('Passenger Count')
plt.show()
In [15]:
plot("Age",30)
print "The above distribution of Age seems a little bit deviating from normal distribution"
print
plot("SibSp",8)
print "The above distribution of SibSp seems a right-skewed distribution"
plot("Parch",6)
print "The above distribution of Age seems a right-skewed distribution"
In [16]:
sns.factorplot(x="Sex", y="Age", data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["Male","Female"])
plt.title('Boxplot of Age grouped by sex')
print "From the below plot we can see there are more elderly men than women and average age for men is higher than women"
From the above plot we can see that gender played an important role in survival of each individaul
Female Survival rate : 74.2%
Male Survival rate: 18.8%
In [17]:
sns.factorplot(x="Pclass", y="Age", data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["1","2","3"])
plt.title('Boxplot of Age grouped by sex')
print "From the below plot we can see the average age is decreasing from calss 1 to class 3"
From the above plot we can clearly see individuals of different class distibuted for various ages. And the red line shows the average of age for each class
In [18]:
sns.factorplot( 'Sex' , 'Survived', data = data, kind = 'bar')
plt.title('Histogram of Survival rate grouped by Sex')
print "From the plot we can clearly see the survival rate of female is very high"
In [19]:
## GENDER
survivals = pd.crosstab([ data_s.Sex], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=False)
plt.ylabel("Passenger count")
plt.title('Histogram of Passenger count grouped by Sex and survived')
survival = data_s.groupby('Sex')['Survived']
survival.mean()
Out[19]:
In [20]:
#PCLASS
survivals = pd.crosstab([data_s.Pclass], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=True)
plt.ylabel("Passenger count")
plt.title('Histogram of Passenger count grouped by Class')
survival=data.groupby(['Pclass'])
survival.mean()
Out[20]:
A passenger from Class 1 is about 2.5x times more likely to survive than a passenger in Class 3.
Social-economic standing was a factor in survival rate of passengers.
In [30]:
survivals = pd.crosstab([data_s.Pclass, data_s.Sex], data_s.Survived.astype(bool))
survivals.plot(kind='bar', stacked=True)
survive=data.groupby(['Sex','Pclass'])
plt.ylabel("Passenger count")
plt.title('Histogram of passenger count grouped by sex and Class')
#survive.Survived.sum().plot(kind='barh')
survive.mean()
Out[30]:
From the above plot we can see how female individuals are given 1st preference and based on class.
Social-economic standing was a factor in survival rate of passengers by gender
Class 1 - male survival rate: 36.89%
Class 2 - female survival rate: 92.11%
Class 2 - male survival rate: 15.74%
Class 3 - female survival rate: 50.0%
In [23]:
#Age
sns.factorplot(x="Survived", y="Age", hue='Sex', data=data_s, kind="box", size=7, aspect=.8)\
.set_xticklabels(["Expired","Survived"])
plt.title('Boxplot of Age grouped by sex and Survival')
# survive_A=data.groupby(['Sex','Age'])
Out[23]:
From the above boxplot and calculated mean:
In [32]:
#Age
# We are dividing the Age data into 3 buckets of (0-18),(18-40),(40-90)
# and labeling them as 'Childs','Adults','Seniors' respectively
data['group_age'] = pd.cut(data['Age'], bins=[0,18,40,90], labels=['Childs','Adults','Seniors'])
data.head(5)
survive_a=data.groupby(['group_age'])
survival_a = pd.crosstab([data.group_age], data_s.Survived.astype(bool))
survival_a.plot(kind='bar', stacked=True)
plt.title('Bar plot of Passenger count grouped by age categories ')
plt.ylabel("Passenger count")
# sns.factorplot(x="group_age", y="Age", hue='Sex', data=data, kind="box", size=7, aspect=.8)
survive_a.mean()
Out[32]:
These are percentage of survivors for Group_age
In [25]:
def group(d,v):
if (d == 'female') and (v >= 18):
return 'Woman'
elif v < 18:
return 'child'
elif (d == 'male') and (v >= 18):
return 'Man'
data['Category'] = data.apply(lambda row:group(row['Sex'], row['Age']), axis=1)
data.head(5)
Out[25]:
In [26]:
survival_a = pd.crosstab([data.Category], data_s.Survived.astype(bool))
survival_a.plot(kind='bar', stacked=True)
plt.ylabel("Passenger count")
plt.title('Survival by Age category')
data.groupby(['Category']).mean()["Survived"]
Out[26]:
Women and children are given importance in the survival of a number of people.
In [27]:
g = sns.factorplot(x="Category", y="Survived", col="Pclass", data=data,
saturation=.5, kind="bar", ci=None, size=5, aspect=.8)
# Fix up the labels
(g.set_axis_labels('', 'Survival Rate')
.set_xticklabels(["Men", "Women","child"])
.set_titles("Class {col_name}")
.set(ylim=(0, 1))
.despine(left=True, bottom=True))
print 'Histogram of Survival rate grouped by Age Category and Class:'
In [28]:
# We are dividing the Age data into 3 buckets of (0-18),(18-40),(40-90)
# and labeling them as 'Childs','Adults','Seniors' respectively
data['group_age'] = pd.cut(data['Age'], bins=[0,18,40,90], labels=['Childs','Adults','Seniors'])
#finding mean Survival rate grouped by 'group_age','Sex'.
df = data.groupby(['group_age','Sex'],as_index=False).mean().loc[:,['group_age','Sex','Survived']]
f, (ax1, ax2,ax3) = plt.subplots(1, 3,figsize=(15,7))
g = sns.barplot(x="group_age", y="Survived", hue="Sex", data=df,ax=ax1)
ax1.set_title('Survival by Age and Sex')
#finding mean Survival rate grouped by 'group_age'.
data2 = data.groupby(['group_age'],as_index=False).mean().loc[:,['group_age','Survived']]
h = sns.barplot(x="group_age",y='Survived', data=data2,ax=ax2)
ax2.set_title('Survival by Age')
#finding mean Survival rate grouped by 'group_age'.
data3 = data.groupby(['group_age'],as_index=False).count().loc[:,['group_age','Survived']]
hh = sns.barplot(x="group_age",y='Survived', data=data3,ax=ax3)
ax3.set_title('Age distribution in Ship')
ax3.set_ylabel('Age Distribution')
for ax in f.axes:
plt.sca(ax)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
In [29]:
data_C=data.groupby(['Category',"Pclass"]).mean()
data_C.sort("Survived")["Survived"]
Out[29]:
From the above values we can see that the survival rate is increasing from top to bottom. And the from the plot we can see the distribution of survival rate among men ,women and children,based on class.
We observe a order of survival rate based on Age ,Sex and Class:
children and women of upper class |
---|
children and women of middle class |
women of lower class |
children of lower class |
men of upper class |
finally men of the middle class and lower class have least survival rate |
The analysis seems that , A female with upper social-economic standing (first class) and Children,had the best chance of survival. Age did not seem to be a major factor.Man in third class, had the lowest chance of survival. Women and children of all classes, were mostly having a higher survival rate than men in general.
Limitations: